High-performance lock management for flash copy in n-way shared storage systems

ABSTRACT

A method, system, and machine-readable medium for providing high-performance lock management for a flash copy image of a region of data in N-way shared storage systems is disclosed. According to one embodiment, a data processing system is provided which comprises a cache to store a copy of metadata specifying a coherency relationship between a region of data and a flash copy image of the region of data, wherein the metadata is subject to one or more lock protocols controlled by an owner storage controller node; and a client storage controller node, coupled with the cache, comprising an input/output performing component to receive a request to perform an input/output operation on at least one of the region of data and the flash copy image of the region of data and to perform the input/output operation utilizing the copy of the metadata.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application hereby claims benefit of priority under 35U.S.C. § 119 and § 365 to the previously-filed international patentapplication number PCT/GB2003/003567 entitled, “High-Performance LockManagement for Flash Copy in N-Way Shared Storage Systems”, filed onAug. 14, 2003, naming Carlos Francisco Fuente and William James Scalesas inventors, assigned to the assignee of the present application, andhaving a priority date of Nov. 29, 2002 based upon United Kingdom PatentApplication No. 0227825.7 which are both herein incorporated byreference in their entirety and for all purposes.

BACKGROUND

1. Technical Field

The present invention relates to the field of computer storagecontrollers, and particularly to advanced function storage controllersin n-way shared storage systems providing a Flash Copy function.

2. Description of the Related Art

In the field of computer storage systems, there is increasing demand forwhat have come to be described as “advanced functions”. Such functionsgo beyond the simple I/O functions of conventional storage controllersystems. Advanced functions are well known in the art and depend on thecontrol of metadata used to retain state data about the real or “user”data stored in the system. The manipulations available using advancedfunctions enable various actions to be applied quickly to virtual imagesof data, while leaving the real data available for use by userapplications. One such well-known advanced function is ‘Flash Copy’.

At the highest level, Flash Copy is a function where a second image of‘some data’ is made available. This function is sometimes known asPoint-In-Time copy, or T0-copy. The second image's contents areinitially identical to that of the first. The second image is madeavailable ‘instantly’. In practical terms this means that the secondimage is made available in much less time than would be required tocreate a true, separate, physical copy, and that this means that it canbe established without unacceptable disruption to a using application'soperation.

Once established, the second copy can be used for a number of purposesincluding performing backups, system trials and data mining. The firstcopy continues to be used for its original purpose by the original usingapplication. Contrast this with backup without Flash Copy, where theapplication must be shut down, and the backup taken, before theapplication can be restarted again. It is becoming increasinglydifficult to find time windows where an application is sufficiently idleto be shut down. The cost of taking a backup is increasing. There issignificant and increasing business value in the ability of Flash Copyto allow backups to be taken without stopping the business.

Flash Copy implementations achieve the illusion of the existence of asecond image by redirecting read I/O addressed to the second image(henceforth Target) to the original image (henceforth Source), unlessthat region has been subject to a write. Where a region has been thesubject of a write (to either Source or Target), then to maintain theillusion that both Source and Target own their own copy of the data, aprocess is invoked which suspends the operation of the write command,and without it having taken effect, issues a read of the affected regionfrom the Source, applies the read data to the Target with a write, then(and only if all steps were successful) releases the suspended write.Subsequent writes to the same region do not need to be suspended sincethe Target will already have its own copy of the data. Thiscopy-on-write technique is well known and is used in many environments.

All implementations of Flash Copy rely on a data structure which governsthe decisions discussed above, namely, the decision as to whether readsreceived at the Target are issued to the Source or the Target, and thedecision as to whether a write must be suspended to allow thecopy-on-write to take place. The data structure essentially tracks theregions or grains of data that have been copied from source to target,as distinct from those that have not.

Maintenance of this data structure (hereinafter called metadata) is keyto implementing the algorithm behind Flash Copy.

Flash Copy is relatively straightforward to implement within a singleCPU complex (possibly with SMP processors), as is often employed withinmodern storage controllers. With a little more effort, it is possible toimplement fault tolerant Flash Copy, such that (at least) two CPUcomplexes have access to a copy of the metadata. In the event of afailure of the first CPU complex, the second can be used to continueoperation, without loss of access to the Target Image.

However, the I/O capability of a single CPU complex is limited. Thoughimproving the capabilities of a single CPU complex measured in terms ofeither I/Os per second, or bandwidth (MB/s) has a finite limit, and willeventually impose a constraint on the performance of the usingapplications. This limit arises in many implementations of Flash Copy,but a good example is in Storage Controllers. A typical storagecontroller has a single (or possibly a redundant pair) of CPU complexes,which dictate a limit in the performance capability of that controller.

More storage controllers can be installed. But the separate storagecontrollers do not share access to the metadata, and therefore do notcooperate in managing a Flash Copy image. The storage space becomesfragmented, with Flash Copy being confined to the scope of a singlecontroller system. Both Source and Target disks must be managed withinthe same storage controller. A single storage controller disk spacemight become full, while another has some spare space, but it is notpossible to separate the Source and Target disks, placing the Targetdisk under the control of the new controller. (This is particularlyunfortunate in the case of a new Flash Copy, where moving the Target isa cheap operation, as it has no physical data associated with it).

As well as constraining the total performance possible for aSource/Target pair, the constraint of single-controller storagefunctions adds complexity to the administration of the storageenvironment.

Typically, storage control systems today do not attempt to solve thisproblem. They implement Flash Copy techniques that are confined to asingle controller, and hence are constrained by the capability of thatcontroller.

A simple way of allowing multiple controllers to participate in a sharedFlash Copy relationship is to assign one controller as the Owner of themetadata, and have the other controllers forward all read and writerequests to that controller. The owning controller processes the I/Orequests as if they came directly from its own attached host servers,using the algorithm described above, and completes each I/O request backto the originating controller.

The main drawback of such a system, and the reason that it is not widelyused, is that the burden of forwarding each I/O request is too great,possibly even doubling the total system-wide cost, and henceapproximately halving the system performance.

It is known, for example, in the area of distributed parallel databasesystems, to have a distributed lock management structure employing atwo-phase locking protocol to hold locks on data in order to maintainany copies of the data in a coherency relation. However, two phaselocking is typically time-consuming and adds a considerable messagingburden to the processing. As such, an unmodified two-phase lockingprotocol according to the prior art is disadvantageous in systems at alower level in the software and hardware stack, such as storage areanetworks having distributed storage controllers where the performanceimpact of the passing of locking control messages is even moresignificant than it is at the database control level.

It would therefore be desirable to gain the advantages of distributedlock management in a Flash Copy environment while incurring the minimumlock messaging overhead.

BRIEF SUMMARY

The present invention accordingly provides a method, system, andmachine-readable medium for providing high-performance lock managementfor a flash copy image of a region of data in N-way shared storagesystems. According to one embodiment, a data processing system isprovided which comprises a cache to store a copy of metadata specifyinga coherency relationship between a region of data and a flash copy imageof the region of data, wherein the metadata is subject to one or morelock protocols controlled by an owner storage controller node; and aclient storage controller node, coupled with the cache, comprising aninput/output performing component to receive a request to perform aninput/output operation on at least one of the region of data and theflash copy image of the region of data and to perform the input/outputoperation utilizing the copy of the metadata.

According to another embodiment, a method is provided which comprisesstoring a copy of metadata specifying a coherency relationship between aregion of data and a flash copy image of the region of data within acache, wherein the metadata is subject to one or more lock protocolscontrolled by an owner storage controller node; receiving a request toperform an input/output operation on at least one of the region of dataand the flash copy image of the region of data at a client storagecontroller node; and performing the input/output operation utilizing aninput/output performing component of the client storage controller nodeand the copy of the metadata.

According to yet another embodiment, a machine-readable medium isprovided having a plurality of instructions executable by a machineembodied therein, wherein the plurality of instructions, when executed,cause said machine to perform the method previously described herein.

The foregoing is a summary and thus contains, by necessity,simplifications, generalizations and omissions of detail; consequently,those skilled in the art will appreciate that the summary isillustrative only and is not intended to be in any way limiting. As willalso be apparent to one of skill in the art, the operations disclosedherein may be implemented in a number of ways including implementationin hardware, i.e. ASICs and special purpose electronic circuits, andsuch changes and modifications may be made without departing from thisinvention and its broader aspects. Other aspects, inventive features,and advantages of the present invention, as defined solely by theclaims, will become apparent in the non-limiting detailed descriptionset forth below.

BRIEF DESCRIPTION OF THE DRAWINGS

A preferred embodiment of the present invention will now be described byway of example only, with reference to the accompanying drawings, inwhich:

FIG. 1 is a flow diagram illustrating one embodiment of a two-phaselocking scheme using lock messages to control coherency between a regionof data and a Flash Copy image of the data;

FIG. 2 shows system components of a system according to one embodimentof the present invention; and

FIG. 3 shows additional process operations of an alternative embodimentof the present invention.

DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT

For better understanding of the presently described illustrativeembodiments of the present invention, it is necessary to describe theuse of two-phase lock messaging to co-ordinate activity between pluralstorage controllers (or nodes) in an n-way storage system.

As an example, consider an n-way system implementing Flash Copy. Assumeevery node has access to the storage managed by the co-operating set ofn nodes. One of the nodes is designated as an owner (process block 102)for metadata relating to all I/O relationships of a region of storage.The other nodes are designated as clients. In one embodiment, one of theclient nodes is further designated as a backup owner and maintains acopy of the metadata in order to provide continuous availability in theevent of a failure of the owner node.

Consider a host I/O request arriving (process block 104) at a particularclient node (‘C’). Suppose that the host I/O request is either a Read orWrite of the Target disk, or possibly a Write of the Source disk. ClientC begins processing by suspending (process block 106) the I/O. C thensends (process block 108) a message REQ to the Owner node O, asking ifthe grain has been copied.

On receipt of message REQ, O inspects its own metadata structures. If Ofinds that the region has already been copied, O replies (process block110) with a NACK message. If O finds that the region has not alreadybeen copied, it places a lock record against the appropriate metadatafor the region within its own metadata structures, and replies (processblock 112) with a GNT message. The lock record is required to ensurecompatibility between the request just received and granted, and furtherrequests that might arrive affecting the same metadata while theprocessing at C continues. Various techniques to maintain the lockrecord and to define the compatibility constraints as if the I/O hadbeen received locally by O may be implemented in embodiments of thepresent invention.

On receipt of a NACK message, C unpends (process block 114) the originalI/O request. On receipt of the GNT message, C continues (process block116) by performing the data transfer or transfers required by the FlashCopy algorithm. In the case of a Target Read, this means performing theread to the source disk. Some time later, C will indicate completion(process block 118) of the read request, and will issue (process block120) an UNL message to O, at the same time as completing the originalI/O request to the host system that issued it.

O, on receipt of an UNL message, removes (process block 122) the lockrecord from its metadata table, thus possibly releasing further I/Orequests that were suspended because of that lock. According to oneembodiment, O then delivers (process block 124) a UNLD message to C,allowing C to reuse the resources associated with the original request.This is, however, not required by the Flash Copy algorithm itself.

In the case of a write (to either Target or Source) C performs thecopy-on-write (process block 127). Having completed all steps of thecopy-on-write, and with the original write I/O request still suspended,C issues (process block 126) an UNLC request to O. O, on receipt of anUNLC message, marks (process block 128) in metadata the region affectedas having been copied, removes (process block 130) the lock record,informs (process block 132) any waiting requests that the area has nowbeen copied, and then issues (process block 134) an UNLD message to C.C, on receipt of a UNLD message, releases (process block 136) thesuspended write operation, which will some time later complete, and thenC completes (process block 138) the write operation to the host.According to another embodiment of the present invention, recovery pathsare required to cater for the situation where a disk I/O fails, or themessaging system fails, or a node fails.

The above description was from the point of view of a single I/O, and asingle Client C. Embodiments of the present invention may be implementedhowever in the presence of multiple I/Os, from multiple client nodes,with O continuing to process all requests utilizing the same algorithm.

Turning now to FIG. 2, there is shown an apparatus in accordance with anembodiment of the present invention. The depicted apparatus is embodiedwithin a storage controller network 200 comprising an Owner 202, aClient 204 I/O performing component, a portion of metadata 206 relatingto data 208 held under the control of the storage network, a copy 209 ofthe data 208, and communication means. The apparatus includes anownership assignment component 210 to assign ownership of metadata toOwner 202, and a lock management component 212 operable to controllocking at a metadata 206 level during I/O activity to ensure coherencywith any copy 209. Included also is a messaging component 214 at Owner202. In the depicted embodiment, messaging component 214 is operable topass one or more messages between Client 204 and Owner 202 to request aresponse regarding a metadata state, grant a lock, request release of alock, and/or signal that a lock has been released. Client 204 issimilarly operable in the illustrated embodiment to perform I/O on datawhose metadata is owned by any owner (e.g., Owner 202), subject toClient's 204 compliance with the lock protocols at the metadata levelcontrolled by that owner.

The system and method thus described are capable of handling distributedlock management in an n-way shared storage controller network, butrequire considerable messaging overhead in the system to operate. Thisis not burdensome in systems containing relatively few controllers orwhere there is relatively little activity, but in other modern storagesystems, such as very large storage area networks, there are likely tobe many controllers and a very high level of storage activity. Undersuch circumstances, the avoidance of unnecessary messaging overheadwould be advantageous.

Thus, in another embodiment of the present invention, each client nodeis provided with the ability to maintain information which records thelast response received from an Owner. Specifically (described in termsof additions to FIG. 1 according to FIG. 3), a client node C ispermitted to cache (process block 308) data indicating a NACK messagewas received (after process block 114 of FIG. 1), or that itself issuedand had acknowledged an UNLC/UNLD message pair (at process block 126 andafter process block 134 of FIG. 1).

On receipt (process block 302) of a host I/O request as at process block104 of FIG. 1, Client C now applies a modified lock control algorithm,as follows. C first inspects its cached data (process block 303), to seeif it has a positive indication that the region affected has alreadybeen copied. If it has, then it continues (process block 304) with theI/O without sending any protocol messages to O. If the cache contains nosuch positive indication, the unmodified protocol described above andillustrated herein at process block 106 et seq. of FIG. 1) is used. Thereceipt (process block 306) of a NACK or an UNLC/UNLD pair causes acaching of information to be updated (process block 308), and subsequentI/Os that affect that region, finding this information in the cache(process block 303), can proceed (process block 304) without issuing anyprotocol messages.

The term ‘pessimistic cache’ is sometimes used to describe the approachneeded in the presently described embodiment of the present invention.This means that a client need not be fully up-to-date with the Owner'smetadata; a client may believe an area needs to be copied, and becorrected by the owner (NACK) to say it does not. However, a clientshould not believe that an area has been copied when a correspondingowner knows it has not.

The lock caching of the presently described embodiment requires certainchanges to the client for correct operation. First, the cache must beinitialized (process block 301) (e.g., to indicate that all regions mustbe copied) each time a Flash Copy relationship is started (process block300 a). This might be driven in a number of ways, but a message fromOwner to Client is the most straightforward to implement. Second, anytime a client node might have missed a message (process block 300 b)indicating that the cache has been reinitialized (perhaps because of anetwork disturbance), the client must assume the worst case andreinitialize (process block 301) or revalidate its cache.

Further extensions and variations are possible, as will be clear to oneskilled in the art. For example, the cached information is discardable,as it can always be recovered from the owner node, which has the onlytruly up-to-date copy. Thus, the client could have less metadata spaceallocated for caching information than would be required to store allthe metadata held on all the nodes. The clients could then rely onlocality of access for the I/Os they process to ensure that theycontinue to benefit from the caching of the lock message information.

In a further extended embodiment, a NACK message (and also the GNT orUNLD messages) can carry back more information than that relating to theregion being directly processed by the REQ/GNT/UNLC/UNLD messages.Information concerning neighboring regions that have also been cleanedcan be sent from owners to clients.

It will be appreciated that the method described above will typically becarried out in software running on one or more processors (not shown),and that the software may be provided as a computer program elementcarried on any suitable data carrier (also not shown) such as a magneticor optical computer disc. The channels for the transmission of datalikewise may include storage media of all descriptions as well as signalcarrying media, such as wired or wireless signal media.

The present invention may suitably be embodied as a computer programproduct for use with a computer system. Such an implementation maycomprise a series of computer readable instructions either fixed on atangible medium, such as a computer readable medium, for example,diskette, CD-ROM, ROM, or hard disk, or transmittable to a computersystem, via a modem or other interface device, over either a tangiblemedium, including but not limited to optical or analog communicationslines, or intangibly using wireless techniques, including but notlimited to microwave, infrared or other transmission techniques. Theseries of computer readable instructions may embody all or part of thefunctionality previously described herein.

Those skilled in the art will appreciate that such computer readableinstructions can be written in a number of programming languages for usewith many computer architectures or operating systems. Further, suchinstructions may be stored using any memory technology, present orfuture, including but not limited to, semiconductor, magnetic, oroptical, or transmitted using any communications technology, present orfuture, including but not limited to optical, infrared, or microwave. Itis contemplated that such a computer program product may be distributedas a removable medium with accompanying printed or electronicdocumentation, for example, shrink-wrapped software, pre-loaded with acomputer system, for example, on a system ROM or fixed disk, ordistributed from a server or electronic bulletin board over a network,for example, the Internet or World Wide Web.

It will be appreciated that various modifications to the embodimentdescribed above will be apparent to a person of ordinary skill in theart.

1. A data processing system comprising: a cache to store a copy ofmetadata specifying a coherency relationship between a region of dataand a flash copy image of said region of data, wherein said metadata issubject to one or more lock protocols controlled by an owner storagecontroller node; and a client storage controller node, coupled with saidcache, comprising an input/output performing component to receive arequest to perform an input/output operation on at least one of saidregion of data and said flash copy image of said region of data and toperform said input/output operation utilizing said copy of saidmetadata.
 2. The data processing system of claim 1, further comprising:a messaging component, coupled between said client storage controllernode and said owner storage controller node, to pass at least one of: amessage to request a lock, a message to grant a lock, a message torequest release of a lock, and a message to signal that a lock has beenreleased.
 3. The data processing system of claim 1, wherein said copy ofsaid metadata comprises, a previous positive confirmation that saidregion of data and a flash copy comprising said flash copy image of saidregion of data are consistent.
 4. The data processing system of claim 3,wherein said input/output performing component is operable to discardsaid previous positive confirmation.
 5. The data processing system ofclaim 3, wherein said previous positive confirmation further comprises apositive confirmation that a further region of data, contiguous withsaid region of data, is consistent with said flash copy.
 6. The dataprocessing system of claim 1, further comprising: a cache storage areato store said cache, wherein said input/output performing component isoperable to selectively discard said copy of said metadata, and saidcache storage area is reduced as a result of said copy of said metadatabeing discarded.
 7. A method comprising: storing a copy of metadataspecifying a coherency relationship between a region of data and a flashcopy image of said region of data within a cache, wherein said metadatais subject to one or more lock protocols controlled by an owner storagecontroller node; receiving a request to perform an input/outputoperation on at least one of said region of data and said flash copyimage of said region of data at a client storage controller node; andperforming said input/output operation utilizing an input/outputperforming component of said client storage controller node and saidcopy of said metadata.
 8. The method of claim 7, further comprising:transferring at least one of: a message to request a lock, a message togrant a lock, a message to request release of a lock, and a message tosignal that a lock has been released between said client storagecontroller node and said owner storage controller node utilizing amessaging component.
 9. The method of claim 7, wherein said copy of saidmetadata comprises, a previous positive confirmation that said region ofdata and a flash copy comprising said flash copy image of said region ofdata are consistent.
 10. The method of claim 9, further comprising:discarding said previous positive confirmation utilizing saidinput/output performing component.
 11. The method of claim 9, whereinsaid previous positive confirmation further comprises a positiveconfirmation that a further region of data, contiguous with said regionof data, is consistent with said flash copy.
 12. The method of claim 7,further comprising: selectively discarding said copy of said metadata tomaintain a reduced cache storage area including said cache.
 13. Amachine-readable medium having a plurality of instructions executable bya machine embodied therein, wherein said plurality of instructions, whenexecuted, cause said machine to perform a method comprising: storing acopy of metadata specifying a coherency relationship between a region ofdata and a flash copy image of said region of data within a cache,wherein said metadata is subject to one or more lock protocolscontrolled by an owner storage controller node; and receiving a requestto perform an input/output operation on at least one of said region ofdata and said flash copy image of said region of data at a clientstorage controller node; and performing said input/output operationutilizing an input/output performing component of said client storagecontroller node and said copy of said metadata.
 14. The machine-readablemedium of claim 13, said method further comprising: transferring atleast one of: a message to request a lock, a message to grant a lock, amessage to request release of a lock, and a message to signal that alock has been released between said client storage controller node andsaid owner storage controller node utilizing a messaging component. 15.The machine-readable medium of claim 13, wherein said copy of saidmetadata comprises, a previous positive confirmation that said region ofdata and a flash copy comprising said flash copy image of said region ofdata are consistent.
 16. The machine-readable medium of claim 15, saidmethod further comprising: discarding said previous positiveconfirmation utilizing said input/output performing component.
 17. Themachine-readable medium of claim 15, wherein said previous positiveconfirmation further comprises a positive confirmation that a furtherregion of data, contiguous with said region of data, is consistent withsaid flash copy.
 18. The machine-readable medium of claim 13, saidmethod further comprising: selectively discarding said copy of saidmetadata to maintain a reduced cache storage area including said cache.